In the past few years, it seems like the housing market in Kansas City has really taken off, so I thought it would be great to take a look at the time series of average home selling price of the past ten years and use it to try to forecast where home prices will be going moving forward. And besides looking solely at average home selling price I wanted to bring in some other time series variables as well to help aid the forecasting and get a better insight of what effects home selling prices. I decided for all my forecast for this time series will go out 36 months, or 3 years, to help to see the full effect of where the forecast is going out to.
Before the I start the analysis I need to get the R environment ready by loading the packages, functions, and data I will need.
First I load the packages I will be using throughout the project, also I set how numbers will be displayed throughout the project, there was a tendency through the project for the output to turned into the scientific form, so for consistency I wanted to make sure that all the numbers were outputted in a similar format.
require(fma)
require(ggplot2)
require(ggthemes)
require(magrittr)
require(zoo)
require(fBasics)
require(fpp)
require(forecast)
require(knitr)
require(lmtest)
require(corrplot)
require(mvtsplot)
options("scipen"=1, "digits"=4)The next thing was to load a custom function that I have used in many different assignments through the year. It allows for plots to be plotted together very easily which is extremely useful when comparing multiple charts, it is the best when the plots can be placed right next to each other for reference.
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}And lastly, before I can start the analysis I need to load the dataset that I will be performing the analysis on. The dataset I am using is a custom dataset that has information for the past 10 years, broken down by month, on the average home selling price in Kansas City, which is the main variable that I will be looking at throughout the project, but I also have the number of home listing in Kansas City, the number of new housing permits in Kansas City, and the Housing Price Index for the United States. After I load the dataset I immediately turn the first variable of the average home selling price in Kansas City into a time series so that I can begin analysis.
kchouse <- read.csv("kc_housing.csv")
kcts <- ts(kchouse$avg_price, start=c(2007,8),frequency=12)
#kctsThe analysis we will begin with the single variable time series, for which the time series data has already been created.
The first analysis that will be done on the time series is the average selling place of homes in Kansas City will be a visual analysis.
A simple time series plot will help to aid in understanding the dataset.
kcts %>% autoplot()+
theme_minimal() +
geom_line(aes(colour="price"))+
scale_colour_calc() +
labs(title = "Average Kansas City Home Selling Price",subtitle="From 10/10/2006 to 10/9/2017",x = "Date", y = "Price")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))A lot can be learned from this first plot. The first thing that is noticed is that there is clear seasonality to the time series, which appears to be a yearly season. Also there is an in a positive trend to the dataset, but there was a definite downturn of the data between the start of the data in August of 2007 where it bottomed out in January of 2009, which matches the timeframe of the last financial crisis, but what it is very interesting is that is seems like the downturn in the housing market happened before the downturn in the stock market.
But a few years after the crisis home prices stayed fairly stable. After the drop in 2008 for the next 3 years, home prices stayed relatively flat before starting it 7 plus year rise in average home prices.
Also in looking at the chart, it seems like the dataset is not stationary, or in other words that the data is not normalized. A simple way to tell this is that the mean of a normalized data is zero, which this dataset clearly is not. There are other visual and statistical tests that can be run to confirm this.
The next visual analysis will be a decomposition analysis.
autoplot(stl(kcts, s.window="periodic", robust=TRUE))+
theme_minimal()+
labs(title = "Decomposition of Average Kansas City Home Selling Price")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))#decompose(kcts)%>%autoplot()+theme_minimal()+scale_colour_calc()The decomposition plot shows the trend, the yearly seasonal effect, and the error of the time series. It shows that there is indeed a yearly seasonality in the dataset. But it also shows that the trend of the data is not what it originally appeared. In the first plot it seemed like the home prices stabilized after 2008, but when the seasonality has been removed it is clear to see that prices did not stabilize, it actually continued to drop until the first half of 2011. What this plot shows is that there is also remainder in the dataset that is not explained by just the trend and seasonality of the dataset alone.
These remainders are going to be a bit of an issue when trying to predict the future forecast based on the single time series alone, but hopefully, when the other variables are brought in later they will help to explain these remainders in the data.
For the next visual analysis the ACF and PACF of the dataset.
p1<-ggAcf(kcts) + theme_minimal()+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+ ggtitle('ACF and PACF of Average Kansas Ctiy Home Selling Price')
p2<-ggPacf(kcts) + theme_minimal()+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+ ggtitle('')
multiplot(p1,p2, cols =1)The ACF plots confirm that the data is highly correlated with itself, showing at least autocorrelation for 24 lag. And with the gentle slope of the ACF plot is tell us that the data is non-stationary, meaning that the dataset is not normally distributed. And the spikes of correlations at the 12 and 24 lag marks helps to confirm our yearly seasonality effect. This plot helps to show the data is not just white noise, but that can not be told when the data is not normalized.
The PACF plot shows that once the lag effect of the dataset have been removed that there is still autocorrelation in the dataset, meaning that the data has partial autocorrelation. The spikes at the 1 lag and 13 also show there is yearly seasonality as well.
Now for analysis, we look at the histogram and density plot of the dataset.
ggplot(kcts, aes(x=y)) +
geom_histogram(aes(y=..density..),binwidth=5000, fill = "goldenrod")+
geom_density(aes(y=..density.., colour ="Navy")) +
geom_vline(aes(xintercept=mean(y, na.rm=TRUE)),color="grey69", linetype="dashed", size=1)+
labs(title = "Histogram of Average Kansas City Home Selling Price", x = "Price", y = "Density") +
theme_minimal() +
scale_colour_calc() +
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))This histogram and density plot provide a lot of information about the dataset still. It shows that the dataset is not normally distributed, the data is widely distributed, is skewed to the left, and the mean is not 0.
The last visual analysis that will be done is the QQ-Plot.
y <- quantile(kcts, c(0.25, 0.75)) # Find the 1st and 3rd quartiles
x <- qnorm( c(0.25, 0.75)) # Find the matching normal values on the x-axis
slope <- diff(y) / diff(x) # Compute the line slope
int <- y[1] - slope * x[1]
ggplot() +
aes(sample=kcts) +
stat_qq(distribution=qnorm) +
geom_abline(intercept=int, slope=slope, colour = 'navy') +
ylab("Height") +
xlab("Theoretical")+
ggtitle("QQ-Plot of Average Kansas City Home Selling Price") +
scale_x_continuous()+
scale_y_continuous() +
theme_minimal() +
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))The QQ-Plot shows exactly what has seen above in the histogram and density plot, which the data is not normally distributed. That the data skews to left and is flat.
Now that visual analysis of the dataset has been done, a statistical test will be performed as well.
The first test that will be done is a test for normality, which from our visual test, we are led to believe that the data will not be normally distributed.
normalTest(kcts,method=c("jb")) ##
## Title:
## Jarque - Bera Normalality Test
##
## Test Results:
## STATISTIC:
## X-squared: 6.5054
## P VALUE:
## Asymptotic p Value: 0.03867
##
## Description:
## Sat Feb 26 10:52:27 2022 by user:
The Jarque-Bara Normality Test is testing if the data is not normalized, with the data being normally distributed as the null hypothesis and the data being not normally distributed as the alternative hypothesis. And with a p-value under .05, at .03867, meaning that we reject the null hypothesis and accept that the dataset is not normally distributed.
The next statistical test will be for autocorrelation in the first 24 lags of the data. The visual analysis showed that there was autocorrelation in the dataset.
Box.test(kcts, lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: kcts
## X-squared = 1100, df = 24, p-value <2e-16
With the Box-Ljung Test, the null hypothesis is that the data does not have autocorrelation and the alternative hypothesis is that the data does have autocorrelation. With the p-value of the test being under .05 significantly, with a value under 2e-16, we reject the null hypothesis and conclude that the data is autocorrelated.
Now that is has been determined that the dataset itself is not normally distributed, one of the easiest ways to normalize the data is but taking the first difference of the dataset. Let do a visual analysis of the differenced average home selling price of Kansas City to see what can be found.
Just like with the first analysis, this analysis will begin with a simple time series plot.
kcts %>%diff%>% autoplot()+
theme_minimal() +
geom_line(aes(colour="price"))+
scale_colour_calc() +
labs(title = "Difference of Average Kansas City Home Selling Price",subtitle="From 10/10/2006 to 10/9/2017",x = "Date", y = "Price")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))The differenced plot of the average home selling prices in Kansas City looks completely different than the first plot but it shows the change in home price from month to month. From 2007 to 2009 there seems like the drops are greater than the gains and from 2013 on it seems like the gains are greater than the drop. It seems harder to make out exactly what is taking place with the seasonality in place, but it does appear that the dataset is now normalized because it appears that the mean is closer to 0.
Next analysis will decompose the differenced dataset.
autoplot(stl(diff(kcts), s.window="periodic", robust=TRUE))+theme_minimal()+
labs(title = "Decomposition of Difference of Average Kansas City Home Selling Price")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))The decomposition of the dataset shows that there is still a yearly seasonality effect that takes place. And the trend shows that from pretty much the start of the data until mid 2011 there is a negative trend, then from mid 2011 to the end of the data set the trend is positive, which reflects with what we saw with the regular average selling price data, a decline until mid 2011 then an increase. There is still a good amount of remainder left that the decomposition has not accounted for, that is to be expected.
Now the ACF and PACF of the differenced dataset.
p1 <- ggAcf(diff(kcts)) + theme_minimal()+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+ ggtitle('ACF and PACF Plots of Differenced Avereage Kansas City Home Selling Price')
p2<-ggPacf(diff(kcts)) + theme_minimal()+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+ ggtitle('')
multiplot(p1,p2, cols =1)The ACF of the differenced data by how quickly the slope goes to 0 leads me to believe that the data is now stationary and normally distributed, which is what we are hoping for. And the spikes at the 12 and 24 lags again point to the yearly seasonality. And now that we have determined that the data is normally distributed by the fact that their autocorrelation means that the data is not just white noise and the time series itself can be used in helping to predict forecasted time series.
The PACF plot shows the difference dataset when the lag effect of the dataset has been removed there is still autocorrelation present, meaning that the data has partial autocorrelation. Which can be used in in an autoregressive model. The spikes at the 12 lag show there is yearly seasonality as well, which will have to account for before these plots can be used in helping to determine an ARIMA model.
Now to examine the new histogram and density plot.
ggplot(diff(kcts), aes(x=y)) +
geom_histogram(aes(y=..density..),binwidth=5000, fill = "goldenrod")+
geom_density(aes(y=..density.., colour ="Navy")) +
geom_vline(aes(xintercept=mean(y, na.rm=TRUE)),color="grey69", linetype="dashed", size=1)+
labs(title = "Histogram of Difference of Average Kansas City Home Selling Price", x = "Price", y = "Density") +
theme_minimal() +
scale_colour_calc() +
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))This is histogram and density plot is exactly what we want to see with a normally distributed dataset. The shape of the distribution has the class bell shape, it seems like both sides are evenly distributed, and the mean of the data falls just about at 0 which is what is needed for a normal distribution.
And a QQ-Plot will be used for the last visual analysis of the difference dataset.
y <- quantile(diff(kcts), c(0.25, 0.75)) # Find the 1st and 3rd quartiles
x <- qnorm( c(0.25, 0.75)) # Find the matching normal values on the x-axis
slope <- diff(y) / diff(x) # Compute the line slope
int <- y[1] - slope * x[1]
ggplot() +
aes(sample=diff(kcts)) +
stat_qq(distribution=qnorm) +
geom_abline(intercept=int, slope=slope, colour = 'navy') +
ylab("Height") +
xlab("Theoretical")+
ggtitle("QQ-Plot of Difference of Average Kansas City Home Selling Price") +
scale_x_continuous()+
scale_y_continuous() +
theme_minimal() +
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))With all the points of that QQ-Plot just about landing on or close to the abline, and with no pronouncing shape to the data points, this graph also points to the fact that the differenced dataset is now normally distributed.
In completing the visual analysis of the differenced dataset, statistical analysis will now be performed.
Again, the first test that will be done is a test for normality, which from our visual test we are led to believe that the differenced data is normally distributed.
normalTest(diff(kcts),method=c("jb")) ##
## Title:
## Jarque - Bera Normalality Test
##
## Test Results:
## STATISTIC:
## X-squared: 1.2211
## P VALUE:
## Asymptotic p Value: 0.5431
##
## Description:
## Sat Feb 26 10:52:30 2022 by user:
With the p-value of the test well above the p-value of .05 at .5431, we will accept the null hypothesis that the differenced data is normally distributed, unlike the original dataset.
The next will be the test for autocorrelation in the first 24 lags of the differenced data. The visual analysis showed that there was autocorrelation in the differenced dataset.
Box.test(diff(kcts), lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: diff(kcts)
## X-squared = 130, df = 24, p-value <2e-16
In using the Box-Ljung Test with the resulting p-value of the test being under .05 significantly, with a value under 2e-16, we reject the null hypothesis and conclude that the data is autocorrelated.
Now in looking and test at the time series data and difference data, we can tell that there is there is seasonality to the data. And since the data is by month and the significant lags always seem to be separated by 12 lags it can be concluded that the data has yearly seasonality to it. To get a better look at this seasonality I decided to create some seasonality plots to better examine the information.
The first plot I created is a seasonal plot, which is a plot of the season variables, which in our case is every month of the year, and plots all the yearly data together on some year time frame.
ggseasonplot(kcts) + labs(title="Seasonal Plot of Average Kansas City Home Selling Price",subtitle="From 10/10/2006 to 10/9/2017", y = "Price")+theme_minimal()+
theme(legend.position="right",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))Not only does this result in a visually striking graph but a lot can be learned from it. In looking at the year over year information it seems like the month that has the lowest selling price in February and the month with the highest selling price in July. Which means the best time to sell your house is in the summer and the best time to buy is in winter, according to this graph.
Not surprising the year with the highest average home selling price is 2007. And the year with the lowest sales is 2010, which was discovered to be the case earlier in the decomposition of the time series data. But this helps to go to show how closely the home prices were between 2008 and 2011 until home price begin to rise in the second half of the year.
And the other seasonal plot we will look at is the boxplot of the seasonal effect.
ggplot() +
geom_boxplot(aes(x = factor(cycle(kcts)), y = kcts, fill =factor(cycle(kcts)))) +
scale_x_discrete(labels=c("1"="Jan","2"="Feb","3"="Mar","4"= "Apr","5"="May", "6"="Jun","7"="Jul","8"="Aug","9"="Sep","10"="Oct","11"="Nov","12"="Dec"))+
theme_minimal() +
scale_colour_calc() +
labs(title = "Seasonal Boxplot of Average Kansas City Home Selling Price",subtitle="From 10/10/2006 to 10/9/2017",x = "Month", y = "Price")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))The is confirms some of the things discovered in the last plot which it looks like February is the lowest average selling price month with June being the traditionally the highest. What is interesting is the amount of the amount of variation that the graph shows between average home selling. It seems like there is a similar variation between January and May but the variation shrinks in June and also is noticeably lower in August, September, and October, but then the variation begins to widen as the year comes to a close.
Now that both visual and statistical testing has been completed on the dataset, I will begin creating and selecting models for future forecasting starting with the Holts Winters model. I choose to start with Holts Winters instead of just the Holts because of the seasonality that my data has. For this model and the ETS model, I will use just the normal time series data.
I will start by creating 4 different Holts Winters Models: additive, additive damped, multiplicative, and multiplicative damped.
fit1 <- hw(kcts, seasonal="additive", h=36)
fit2 <- hw(kcts, seasonal="additive", damped = TRUE, h=36)
fit3 <- hw(kcts,seasonal="multiplicative",h=36)
fit4 <- hw(kcts,seasonal="multiplicative", damped=TRUE, h=36)
fit1$model## Holt-Winters' additive method
##
## Call:
## hw(y = kcts, h = 36, seasonal = "additive")
##
## Smoothing parameters:
## alpha = 0.4028
## beta = 0.0354
## gamma = 0.0001
##
## Initial states:
## l = 180374.9332
## b = -489.709
## s = 11060 15997 6922 -2288 -6449 -14442
## -12750 -2040 -1221 -937.5 -122.4 6272
##
## sigma: 4572
##
## AIC AICc BIC
## 2614 2620 2661
fit2$model## Damped Holt-Winters' additive method
##
## Call:
## hw(y = kcts, h = 36, seasonal = "additive", damped = TRUE)
##
## Smoothing parameters:
## alpha = 0.285
## beta = 0.0982
## gamma = 0.0001
## phi = 0.9003
##
## Initial states:
## l = 180374.8839
## b = -991.5613
## s = 11060 15997 6922 -2288 -6449 -14418
## -12715 -1759 -1151 -937.3 -122.7 5860
##
## sigma: 4539
##
## AIC AICc BIC
## 2613 2620 2663
fit3$model## Holt-Winters' multiplicative method
##
## Call:
## hw(y = kcts, h = 36, seasonal = "multiplicative")
##
## Smoothing parameters:
## alpha = 0.2291
## beta = 0.1064
## gamma = 0.0001
##
## Initial states:
## l = 180382.5764
## b = -474.5543
## s = 1.057 1.087 1.039 0.9884 0.964 0.9218
## 0.934 0.9923 0.994 0.9926 0.9958 1.034
##
## sigma: 0.0297
##
## AIC AICc BIC
## 2649 2655 2696
fit4$model## Damped Holt-Winters' multiplicative method
##
## Call:
## hw(y = kcts, h = 36, seasonal = "multiplicative", damped = TRUE)
##
## Smoothing parameters:
## alpha = 0.4189
## beta = 0.0353
## gamma = 0.0001
## phi = 0.9695
##
## Initial states:
## l = 180382.6574
## b = -982.3854
## s = 1.056 1.086 1.039 0.9883 0.9647 0.9224
## 0.932 0.9919 0.9948 0.9943 0.9958 1.034
##
## sigma: 0.0282
##
## AIC AICc BIC
## 2637 2643 2687
It is interesting that will all these models that they have an Alpha under .5 and a Beta under .1. This helps to show that these models use many different lags back to help forecast the future points.
Now that the models have been created let’s see how the models look visually.
kcts%>%autoplot +
labs(title = "Holts Winters Forecasts of Average Kansas City Home Selling Price",subtitle="From 2006 to 2021",x = "Date", y = "Price")+
#geom_line(aes(colour="Additive"),fitted(fit1))+
geom_line(aes(colour="Additive"),fit1$mean)+
#geom_line(aes(colour="Multiplicative"),fitted(fit2))+
geom_line(aes(colour="Multiplicative"),fit2$mean)+
#geom_line(aes(colour="Additive damped trend"),fitted(fit3))+
geom_line(aes(colour="Additive damped trend"),fit3$mean)+
#geom_line(aes(colour="Multiplicative damped trend"),fitted(fit4))+
geom_line(aes(colour="Multiplicative damped trend"),fit4$mean)+
theme_minimal() +
scale_colour_calc() +
theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))kcts%>%autoplot +
labs(title = "Holts Winters Forecasts of Average Kansas City Home Selling Price",subtitle="From 2015 to 2021",x = "Date", y = "Price")+
#geom_line(aes(colour="Additive"),fitted(fit1))+
geom_line(aes(colour="Additive"),fit1$mean)+
#geom_line(aes(colour="Multiplicative"),fitted(fit2))+
geom_line(aes(colour="Multiplicative"),fit2$mean)+
#geom_line(aes(colour="Additive damped trend"),fitted(fit3))+
geom_line(aes(colour="Additive damped trend"),fit3$mean)+
#geom_line(aes(colour="Multiplicative damped trend"),fitted(fit4))+
geom_line(aes(colour="Multiplicative damped trend"),fit4$mean)+
coord_cartesian(xlim = c(2015, 2021), ylim = c(175000,275000))+
theme_minimal() +
scale_colour_calc() +
theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))It seems like with the four models, the additive damped, multiplicative, and multiplicative damped all created similar results with the additive seeming to increase at a greater rate than the other. The slope of the additive model seems to match the slope of the past few years of data where the other models seem to be more conservative in their forecasts. But all models seem to capture the seasonal effect very well.
To determine which model will be the best to use, we’ll look at the AIC, AICC, and other Accuracy of the models.
fit1$model$aic## [1] 2614
fit2$model$aic## [1] 2613
fit3$model$aic## [1] 2649
fit4$model$aic## [1] 2637
fit1$model$aicc## [1] 2620
fit2$model$aicc## [1] 2620
fit3$model$aicc## [1] 2655
fit4$model$aicc## [1] 2643
kable(accuracy(fit1))| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
|---|---|---|---|---|---|---|---|
| Training set | 316.9 | 4257 | 3309 | 0.1632 | 1.923 | 0.3108 | -0.075 |
kable(accuracy(fit2))| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
|---|---|---|---|---|---|---|---|
| Training set | 390.6 | 4205 | 3328 | 0.1719 | 1.922 | 0.3126 | -0.033 |
kable(accuracy(fit3))| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
|---|---|---|---|---|---|---|---|
| Training set | 14.97 | 4743 | 3752 | -0.0113 | 2.145 | 0.3525 | 0.1892 |
kable(accuracy(fit4))| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
|---|---|---|---|---|---|---|---|
| Training set | 479.2 | 4406 | 3537 | 0.2269 | 2.039 | 0.3322 | -0.0468 |
The AIC between the model is tied between the additive and additive damped, but the AICC is lower for the additive. But the additive damped model does have a lower RSME, but the additive has a lower MAE. It is difficult to choose which is the best model to use so I am going to use the ETS model to aid in the selecting, to see which it will recommend.
The ETS model can choose from every Holts Winters model and more because it can use multiplicative error and not the just additive error like Holts Winters, but because the time series is not exponential the ETS model will just be selecting from the same options that the Holts Winters provided.
For the ETS models I have decided to just let the function pick the best model to use but for the first first model I want it to pick based on AIC and then the second model will just use the default parameters to pick.
fit5 <- ets(kcts, ic=c("aic"))
fit5## ETS(A,Ad,A)
##
## Call:
## ets(y = kcts, ic = c("aic"))
##
## Smoothing parameters:
## alpha = 0.2852
## beta = 0.098
## gamma = 0.0001
## phi = 0.9006
##
## Initial states:
## l = 180374.8841
## b = -991.5614
## s = 11060 15997 6922 -2288 -6449 -14418
## -12715 -1759 -1151 -937.3 -122.7 5860
##
## sigma: 4539
##
## AIC AICc BIC
## 2613 2620 2663
fit6 <- ets(kcts)
fit6## ETS(A,Ad,A)
##
## Call:
## ets(y = kcts)
##
## Smoothing parameters:
## alpha = 0.2852
## beta = 0.098
## gamma = 0.0001
## phi = 0.9006
##
## Initial states:
## l = 180374.8841
## b = -991.5614
## s = 11060 15997 6922 -2288 -6449 -14418
## -12715 -1759 -1151 -937.3 -122.7 5860
##
## sigma: 4539
##
## AIC AICc BIC
## 2613 2620 2663
Based on AIC the ETS models selected was an additive error, additive damped trend, and additive seasonal effect, which is what the Holts Winters additive damped model is. And Based on the normal function the ETS model selected was an additive error, additive trend, and additive seasonal effect. It good to see that the two models selected were the same two models that were being looked at earlier but it would have been nice to have had a clear cut winner between the two, though it does seem like we are indeed looking in the right place.
Now that we have two ETS models created let’s see how their forecasts look.
fc5<-forecast(fit5, h=36)
fc6<-forecast(fit6, h=36)
kcts%>%autoplot +
labs(title = "ETS Forecasts of Average Kansas City Home Selling Price",subtitle="From 2006 to 2021",x = "Date", y = "Price")+
#geom_line(aes(colour="ETS(AAdA)"),fitted(fit5))+
geom_line(aes(colour="ETS(AAdA)"),fc5$mean)+
#geom_line(aes(colour="ETS(AAA)"),fitted(fit6))+
geom_line(aes(colour="ETS(AAA)"),fc6$mean)+
theme_minimal() +
scale_colour_calc(breaks= c("ETS(AAdA)","ETS(AAA)")) +
theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))kcts%>%autoplot +
labs(title = "ETS Forecast of Average Kansas City Home Selling Price",subtitle="From 2015 to 2021",x = "Date", y = "Price")+
#geom_line(aes(colour="ETS(AAdA)"),fitted(fit5))+
geom_line(aes(colour="ETS(AAdA)"),fc5$mean)+
#geom_line(aes(colour="ETS(AAA)"),fitted(fit6))+
geom_line(aes(colour="ETS(AAA)"),fc6$mean)+
theme_minimal() +
scale_colour_calc(breaks= c("ETS(AAdA)","ETS(AAA)")) +
theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+
coord_cartesian(xlim = c(2015, 2021), ylim = c(175000,275000))In just like the Holts Winters model, the ETS(A,A,A) model really takes off in what appears to be a similar trajectory that the data has had for the past few years, where the ETS (A,Ad,A) model seems to be predicting a forecast that is about the same for all three years.
To help pick between these two models I will be looking at the accuracy between them yet again
fit5$aic## [1] 2613
fit6$aic## [1] 2613
kable(accuracy(fit5))| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
|---|---|---|---|---|---|---|---|
| Training set | 390.4 | 4205 | 3327 | 0.1718 | 1.922 | 0.3125 | -0.0331 |
kable(accuracy(fit6))| ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | |
|---|---|---|---|---|---|---|---|
| Training set | 390.4 | 4205 | 3327 | 0.1718 | 1.922 | 0.3125 | -0.0331 |
Since the first ETS model was selected by its AIC it makes sense that it has the better AIC between the two, but they are both very close to each other, and the first model does have a better RSME the second model, but the second model has the better MAE.
Even though they are very close I have decided to use the second ETS model for my forecast between the Holts Winters and ETS models. I feel like it could very easily be other model but looking beyond the time series data, home pricing has historical risen year over year so I have selected the best model that I feel like has done that.
Here is what the forecast for the ETS(A,A,A) model looks like with confidence intervals.
fc6%>%autoplot+
labs(title = "ETS(AAA) Forecast of Average Kansas City Home Selling Price",subtitle="From 2015 to 2021",x = "Date", y = "Price")+
theme_minimal() +
scale_colour_calc() +
theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+coord_cartesian(ylim = c(125000,325000))Even though it was tough to pick between the two models, looking at the confidence intervals does make me feel better about my choice, because the possibility that home prices decline is within the 95% confidence intervals that are listed.
For the next time series forecast model, I am going to look at is the different ARIMA models. Through the analysis of the dataset, it has been determined that once the data is differenced it becomes normalized, which the ARIMA model can account for in the model.
Even though we have already looked at two different plots of the dataset now I need to transform it once more, I need to take the 12 lag difference of the differenced dataset. This is to help account for the seasonality so that we can start to estimate the ARIMA models.
diff(diff(kcts),12)%>% autoplot()+
theme_minimal() +
geom_line(aes(colour="price"))+
scale_colour_calc() +
labs(title = "Differced House Prices Differenced to 12th Period",subtitle="From 10/10/2006 to 10/9/2017",x = "Date", y = "Price")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))The new time plot looks like a similar format from the original differenced data, but it also looks completely different from taking the lag of the 12 periods. The data still looks normalized but the spikes seem more drastic in this plot.
To start estimating the ARIMA models will look at the ACF and PACF plots of the 12th differenced difference data to help us to understand autoregressive and moving of the model and the seasonality.
p1 <- ggAcf(diff(diff(kcts),12)) + theme_minimal()+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+ ggtitle('ACF and PACF Plots of Differenced Avereage Kansas City Home Selling Price')
p2<-ggPacf(diff(diff(kcts),12)) + theme_minimal()+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+ ggtitle('')
multiplot(p1,p2, cols =1)Starting with the seasonality there appears to big spikes at the 12th lag still, so the 12th difference was not enough by itself to take care of the seasonality. Since we see those big spikes on both the ACF and PACF pot I think that both an autoregressive of 1 or (1,0,0)12 and a moving average of 1 or (0,0,1)12 should be tried to account for the seasonal effect.
Now the normal model, with the difference of the data we know that model will have a difference of 1. The PACF plot seems to have significant lags at the second position that means an autoregressive effect for 2 lags or (2,1,0), and the ACF plot has a significant lag at first position means there may be a moving average effect at 1 lag or (0,1,1). Or it is possible that both are needed for an ARIMA of (2,1,1)
Now combing those two different parts of the model there are 6 models to test for:
(2,1,0)(1,0,0)12 (2,1,0)(0,0,1)12 (0,1,1)(1,0,0)12 (0,1,1)(0,0,1)12 (2,1,1)(1,0,0)12 (2,1,1)(0,0,1)12
Now that we have identified the models we want to take a look at, let’s create the models.
fit7<-arima(kcts, c(2,1,0), c(1,0,0))
fit8<-arima(kcts, c(2,1,0), c(0,0,1))
fit9<-arima(kcts, c(0,1,1), c(1,0,0))
fit10<-arima(kcts, c(0,1,1), c(0,0,1))
fit11<-arima(kcts, c(2,1,1), c(1,0,0))
fit12<-arima(kcts, c(2,1,1), c(0,0,1))
fit7##
## Call:
## arima(x = kcts, order = c(2, 1, 0), seasonal = c(1, 0, 0))
##
## Coefficients:
## ar1 ar2 sar1
## -0.610 -0.172 0.773
## s.e. 0.102 0.096 0.059
##
## sigma^2 estimated as 35412228: log likelihood = -1209, aic = 2426
fit8##
## Call:
## arima(x = kcts, order = c(2, 1, 0), seasonal = c(0, 0, 1))
##
## Coefficients:
## ar1 ar2 sma1
## -0.199 0.012 0.358
## s.e. 0.109 0.093 0.077
##
## sigma^2 estimated as 54393952: log likelihood = -1230, aic = 2467
fit9##
## Call:
## arima(x = kcts, order = c(0, 1, 1), seasonal = c(1, 0, 0))
##
## Coefficients:
## ma1 sar1
## -0.541 0.745
## s.e. 0.081 0.060
##
## sigma^2 estimated as 36447527: log likelihood = -1210, aic = 2426
fit10##
## Call:
## arima(x = kcts, order = c(0, 1, 1), seasonal = c(0, 0, 1))
##
## Coefficients:
## ma1 sma1
## -0.198 0.354
## s.e. 0.109 0.076
##
## sigma^2 estimated as 54473971: log likelihood = -1230, aic = 2465
fit11##
## Call:
## arima(x = kcts, order = c(2, 1, 1), seasonal = c(1, 0, 0))
##
## Coefficients:
## ar1 ar2 ma1 sar1
## 0.258 0.309 -0.834 0.737
## s.e. 0.334 0.226 0.282 0.063
##
## sigma^2 estimated as 35751300: log likelihood = -1209, aic = 2427
fit12##
## Call:
## arima(x = kcts, order = c(2, 1, 1), seasonal = c(0, 0, 1))
##
## Coefficients:
## ar1 ar2 ma1 sma1
## 0.611 0.042 -0.855 0.343
## s.e. 0.132 0.115 0.073 0.081
##
## sigma^2 estimated as 52236660: log likelihood = -1227, aic = 2464
In just looking at the AIC of the models it seems like the models may be pretty close to each other.
Now that the models are created let’s see how all models look plotted together.
fc7<-forecast(fit7,h=36)
fc8<-forecast(fit8,h=36)
fc9<-forecast(fit9,h=36)
fc10<-forecast(fit10,h=36)
fc11<-forecast(fit11,h=36)
fc12<-forecast(fit12,h=36)
kcts%>%autoplot +
labs(title = "ARIMA Forecasts of Average Kansas City Home Selling Price",subtitle="From 2006 to 2021",x = "Date", y = "Price")+
#geom_line(aes(colour="ARIMA(2,1,0)(1,0,0)12"),fitted(fit7))+
geom_line(aes(colour="ARIMA(2,1,0)(1,0,0)12"),fc7$mean)+
#geom_line(aes(colour="ARIMA(2,1,0)(0,0,1)12"),fitted(fit8))+
geom_line(aes(colour="ARIMA(2,1,0)(0,0,1)12"),fc8$mean)+
#geom_line(aes(colour="ARIMA(0,1,1)(1,0,0)12"),fitted(fit9))+
geom_line(aes(colour="ARIMA(0,1,1)(1,0,0)12"),fc9$mean)+
#geom_line(aes(colour="ARIMA(0,1,1)(0,0,1)12"),fitted(fit10))+
geom_line(aes(colour="ARIMA(0,1,1)(0,0,1)12"),fc10$mean)+
#geom_line(aes(colour="ARIMA(2,1,1)(1,0,0)12"),fitted(fit11))+
geom_line(aes(colour="ARIMA(2,1,1)(1,0,0)12"),fc11$mean)+
#geom_line(aes(colour="ARIMA(2,1,1)(0,0,1)12"),fitted(fit12))+
geom_line(aes(colour="ARIMA(2,1,1)(0,0,1)12"),fc12$mean)+
theme_minimal() +
scale_colour_calc(breaks=c("ARIMA(2,1,0)(1,0,0)12","ARIMA(2,1,0)(0,0,1)12","ARIMA(0,1,1)(1,0,0)12","ARIMA(0,1,1)(0,0,1)12","ARIMA(2,1,1)(1,0,0)12","ARIMA(2,1,1)(0,0,1)12")) +
theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))kcts%>%autoplot +
labs(title = "ARIMA Forecasts of Average Kansas City Home Selling Price",subtitle="From 2006 to 2021",x = "Date", y = "Price")+
#geom_line(aes(colour="ARIMA(2,1,0)(1,0,0)12"),fitted(fit7))+
geom_line(aes(colour="ARIMA(2,1,0)(1,0,0)12"),fc7$mean)+
#geom_line(aes(colour="ARIMA(2,1,0)(0,0,1)12"),fitted(fit8))+
geom_line(aes(colour="ARIMA(2,1,0)(0,0,1)12"),fc8$mean)+
#geom_line(aes(colour="ARIMA(0,1,1)(1,0,0)12"),fitted(fit9))+
geom_line(aes(colour="ARIMA(0,1,1)(1,0,0)12"),fc9$mean)+
#geom_line(aes(colour="ARIMA(0,1,1)(0,0,1)12"),fitted(fit10))+
geom_line(aes(colour="ARIMA(0,1,1)(0,0,1)12"),fc10$mean)+
#geom_line(aes(colour="ARIMA(2,1,1)(1,0,0)12"),fitted(fit11))+
geom_line(aes(colour="ARIMA(2,1,1)(1,0,0)12"),fc11$mean)+
#geom_line(aes(colour="ARIMA(2,1,1)(0,0,1)12"),fitted(fit12))+
geom_line(aes(colour="ARIMA(2,1,1)(0,0,1)12"),fc12$mean)+
theme_minimal() +
scale_colour_calc(breaks=c("ARIMA(2,1,0)(1,0,0)12","ARIMA(2,1,0)(0,0,1)12","ARIMA(0,1,1)(1,0,0)12","ARIMA(0,1,1)(0,0,1)12","ARIMA(2,1,1)(1,0,0)12","ARIMA(2,1,1)(0,0,1)12")) +
theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+
coord_cartesian(xlim = c(2015, 2021), ylim = c(175000,262500))It’s interesting that all models looked closed by there AIC but in graphing them together there are some striking differences. The three models with the autoregressive seasonality effect look a lot like the other models we have seen but the models with moving average seasonal effect the estimation really dies off quickly. It will be interesting to see how all the models perform, but in terms of looking at models to forecast 3 years out, we might have to use the autoregressive seasonality by default.
Now that we have seen the forecast plot I want to see the ACF plots of the residuals of the model to see if all the autocorrelation in the residuals has been accounted for in the model.
p1<-fit7 %>% residuals %>% ggAcf +
theme_minimal() +
scale_colour_calc() +
labs(title = "ARIMA(2,1,0)(1,0,0)12 Model", x = "Time", y = "Residuals")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p2<-fit8 %>% residuals %>% ggAcf +
theme_minimal() +
scale_colour_calc() +
labs(title = "ARIMA(2,1,0)(0,0,1)12 Model", x = "Time", y = "Residuals")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p3<-fit9 %>% residuals %>% ggAcf +
theme_minimal() +
scale_colour_calc() +
labs(title = "ARIMA(0,1,1)(1,0,0)12 Model", x = "Time", y = "Residuals")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p4<-fit10 %>% residuals %>% ggAcf +
theme_minimal() +
scale_colour_calc() +
labs(title = "ARIMA(0,1,1)(0,0,1)12 Model", x = "Time", y = "Residuals")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p5<-fit11 %>% residuals %>% ggAcf +
theme_minimal() +
scale_colour_calc() +
labs(title = "ARIMA(2,1,1)(1,0,0)12", x = "Time", y = "Residuals")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p6<-fit12 %>% residuals %>% ggAcf +
theme_minimal() +
scale_colour_calc() +
labs(title = "ARIMA(2,1,1)(0,0,1)12", x = "Time", y = "Residuals")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
multiplot(p1,p2,p3,p4,p5,p6,cols=1)In first looking at the ACF plot for the residuals for every model it appears that no model accounts for all the autocorrelation. Which is unfortunate for the time series alone but this may be able to be accounted for when we bring in the other variables.
To make sure that all the autocorrelation has not been accounted for I thought we should run the Box-Ljung test for all the residuals for the ARIMA models we have created.
Box.test(residuals(fit7), lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: residuals(fit7)
## X-squared = 50, df = 24, p-value = 0.002
Box.test(residuals(fit8), lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: residuals(fit8)
## X-squared = 120, df = 24, p-value = 7e-15
Box.test(residuals(fit9), lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: residuals(fit9)
## X-squared = 65, df = 24, p-value = 1e-05
Box.test(residuals(fit10), lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: residuals(fit10)
## X-squared = 120, df = 24, p-value = 2e-15
Box.test(residuals(fit11), lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: residuals(fit11)
## X-squared = 56, df = 24, p-value = 0.0002
Box.test(residuals(fit12), lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: residuals(fit12)
## X-squared = 110, df = 24, p-value = 7e-13
In all the Box-Ljong test we rejected the null hypothesis meaning there is still autocorrelation left in the residuals.
Since there is still autocorrelation left in the residuals we can’t determine the best model by which one accounts for all the autocorrelation of the data we will have to look at the accuracy of the models.
summary(fit7)##
## Call:
## arima(x = kcts, order = c(2, 1, 0), seasonal = c(1, 0, 0))
##
## Coefficients:
## ar1 ar2 sar1
## -0.610 -0.172 0.773
## s.e. 0.102 0.096 0.059
##
## sigma^2 estimated as 35412228: log likelihood = -1209, aic = 2426
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 342.8 5926 4624 0.1159 2.704 0.7179 -0.009737
summary(fit8)##
## Call:
## arima(x = kcts, order = c(2, 1, 0), seasonal = c(0, 0, 1))
##
## Coefficients:
## ar1 ar2 sma1
## -0.199 0.012 0.358
## s.e. 0.109 0.093 0.077
##
## sigma^2 estimated as 54393952: log likelihood = -1230, aic = 2467
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 348.3 7344 5751 0.06123 3.325 0.8927 0.0006566
summary(fit9)##
## Call:
## arima(x = kcts, order = c(0, 1, 1), seasonal = c(1, 0, 0))
##
## Coefficients:
## ma1 sar1
## -0.541 0.745
## s.e. 0.081 0.060
##
## sigma^2 estimated as 36447527: log likelihood = -1210, aic = 2426
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 431.4 6012 4552 0.151 2.674 0.7067 -0.07133
summary(fit10)##
## Call:
## arima(x = kcts, order = c(0, 1, 1), seasonal = c(0, 0, 1))
##
## Coefficients:
## ma1 sma1
## -0.198 0.354
## s.e. 0.109 0.076
##
## sigma^2 estimated as 54473971: log likelihood = -1230, aic = 2465
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 367.4 7350 5727 0.06607 3.311 0.8891 -0.001585
summary(fit11)##
## Call:
## arima(x = kcts, order = c(2, 1, 1), seasonal = c(1, 0, 0))
##
## Coefficients:
## ar1 ar2 ma1 sar1
## 0.258 0.309 -0.834 0.737
## s.e. 0.334 0.226 0.282 0.063
##
## sigma^2 estimated as 35751300: log likelihood = -1209, aic = 2427
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 503.2 5954 4533 0.1899 2.653 0.7038 -0.03588
summary(fit12)##
## Call:
## arima(x = kcts, order = c(2, 1, 1), seasonal = c(0, 0, 1))
##
## Coefficients:
## ar1 ar2 ma1 sma1
## 0.611 0.042 -0.855 0.343
## s.e. 0.132 0.115 0.073 0.081
##
## sigma^2 estimated as 52236660: log likelihood = -1227, aic = 2464
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 641.4 7197 5563 0.175 3.212 0.8636 0.0009338
In looking at the models with the lowest AIC the (2,1,0)(1,0,0) and (0,1,1)(1,0,0) models have the lowest AIC between all the models. And between those two models, the first one has the lowest RMSE. The other model with the autoregressive seasonality had good AIC and RMSE but I think if I had to choose a model I would go with the first model of (2,1,0)(1,0,0).
But just to make sure that we picked the right model I am going to use the Auto ARIMA function to see which model it selects.
fit13<-auto.arima(kcts)
fit13## Series: kcts
## ARIMA(1,1,1)(0,1,1)[12]
##
## Coefficients:
## ar1 ma1 sma1
## -0.261 -0.345 -0.833
## s.e. 0.149 0.136 0.148
##
## sigma^2 estimated as 24171419: log likelihood=-1067
## AIC=2142 AICc=2142 BIC=2153
The Auto ARIMA function ended up picking the same model that I selected which is reassuring.
Now that I have selected an ARIMA model I want to plot it with the confidence intervals.
fc7%>%autoplot+
labs(title = "ARIMA(2,1,0)(1,0,0)12 Forecast of Average Kansas City Home Selling Price",subtitle="From 2015 to 2021",x = "Date", y = "Price")+
theme_minimal() +
scale_colour_calc() +
theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+coord_cartesian(ylim = c(125000,325000))Feel like the ARIMA model end up turning out very well, it doesn’t exactly follow the slope of the data, but it still increases as time goes one. The seasonal effect seems to works well and I like that model was created with normalized data. But this model, the Holts Winters, and the ETS model I think that this ARIMA model is the best to use to forecast the average selling price of homes in Kansas City.
Now that we have explored forecasting the average home selling in Kansas City based on the time series alone I want to try to forecast it using other variables as well. In my mind, the forecast of the time series by itself and with other variable is completely different so I decided to separate the analysis. I feel like the analysis of the single time series gave a good simple forecast but now I want to see if we can use other variables to help examine where home selling prices are going to go.
With the original dataset, I collected the number of home listings in Kansas City, the number of new housing permits in Kansas City, and the US Home Price Index (HPI). I will use these other variables to see if they will be helpful in predicting on where home pricing is going.
The first thing I want to do is plot all the time series together to perform a basic visual analysis.
kcvarts <- ts(kchouse[,2:5], start=c(2007,8),frequency=12)
p1<-kcvarts[,1] %>%autoplot+
theme_minimal() +
geom_line(aes(colour="price"))+
scale_colour_calc() +
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),axis.text.x = element_blank(),axis.title.x=element_blank())+
labs(title = "Variable of KC House Data",subtitle="From 10/10/2016 to 10/9/2017",x = "Date", y = "Avg Price")
p2<-kcvarts[,2] %>%autoplot+
theme_minimal() +
geom_line(aes(colour="price"))+
scale_colour_calc() +
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),axis.text.x = element_blank(),axis.title.x=element_blank())+labs(y = "Listing")
p3<-kcvarts[,3] %>%autoplot+
theme_minimal() +
geom_line(aes(colour="price"))+
scale_colour_calc() +
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),axis.text.x = element_blank(),axis.title.x=element_blank())+labs(y = "Permit")
p4<-kcvarts[,4] %>%autoplot+
theme_minimal() +
geom_line(aes(colour="price"))+
scale_colour_calc() +
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+labs(y = "US HPI")
multiplot(p1,p2,p3,p4,cols=1)The results are very interesting, it seems like the number of permits of and the US HPI follows very closely to the original plot of the average home selling price, but the number of listing performs very differently. It seems like for the first part of the plot the number of listing follows the plot of the selling price but when home price picks up the number of houses being sold continued to drop.
Now that we have looked at the variable let’s see how they relate to the time series were are trying to forecast for. To do this I am going to use the cross-correlation function to help be find out at which lags are the different time series are the most correlated.
ccf(kchouse[,2], kchouse[,3])ccf(kchouse[,2], kchouse[,4])ccf(kchouse[,2], kchouse[,5])For the in terms of the average selling price of a home, the number of listing is most correlated with it five lags into the future. But other than that one variable the other variables are most correlated with each other at the lag of 0, of at the same interval of time.
Now that we have the other variables, and know when they are strongly correlated with each other let’s begin the forecasting with a time series linear model.
To make sure that we are on the right path using the other variable I want to run just a quick two quick TSLM models. One with just trend and seasonal effect and the other with a trend, seasonal effect, and the other variables added in.
fit14 = tslm(kcvarts[,1]~trend+season)
fit15 = tslm(kcvarts[,1]~trend+season+kcvarts[,2:4])
summary(fit14)##
## Call:
## tslm(formula = kcvarts[, 1] ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24134 -8778 -247 5368 31834
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 135302.9 4457.3 30.36 < 2e-16 ***
## trend 527.2 33.4 15.80 < 2e-16 ***
## season2 -1882.3 5632.5 -0.33 0.73890
## season3 5777.9 5632.8 1.03 0.30731
## season4 9679.2 5633.3 1.72 0.08865 .
## season5 19014.2 5634.0 3.37 0.00103 **
## season6 27784.5 5634.9 4.93 3e-06 ***
## season7 22110.0 5636.0 3.92 0.00015 ***
## season8 19623.9 5634.9 3.48 0.00072 ***
## season9 12076.0 5634.0 2.14 0.03434 *
## season10 11865.2 5633.3 2.11 0.03752 *
## season11 11910.1 5632.8 2.11 0.03680 *
## season12 10993.1 5632.5 1.95 0.05358 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12600 on 107 degrees of freedom
## Multiple R-squared: 0.744, Adjusted R-squared: 0.716
## F-statistic: 26 on 12 and 107 DF, p-value: <2e-16
summary(fit15)##
## Call:
## tslm(formula = kcvarts[, 1] ~ trend + season + kcvarts[, 2:4])
##
## Residuals:
## Min 1Q Median 3Q Max
## -11695 -2046 199 2374 10236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 303.093 16498.132 0.02 0.98538
## trend 301.878 46.393 6.51 2.7e-09 ***
## season2 -2521.122 1765.547 -1.43 0.15630
## season3 2717.296 1833.029 1.48 0.14126
## season4 5040.063 1864.488 2.70 0.00802 **
## season5 12677.761 1906.967 6.65 1.4e-09 ***
## season6 19997.275 1982.186 10.09 < 2e-16 ***
## season7 15612.737 1944.469 8.03 1.6e-12 ***
## season8 12040.430 1979.942 6.08 2.0e-08 ***
## season9 5994.460 1917.345 3.13 0.00229 **
## season10 6587.403 1893.137 3.48 0.00073 ***
## season11 8799.967 1814.067 4.85 4.3e-06 ***
## season12 9336.104 1795.753 5.20 1.0e-06 ***
## kcvarts[, 2:4]listing 0.546 0.485 1.12 0.26328
## kcvarts[, 2:4]new_permit 91.530 43.604 2.10 0.03822 *
## kcvarts[, 2:4]us_hpi 668.291 53.708 12.44 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3940 on 104 degrees of freedom
## Multiple R-squared: 0.976, Adjusted R-squared: 0.972
## F-statistic: 278 on 15 and 104 DF, p-value: <2e-16
In just this quick model we see the Adjusted R-squared go from .716 to .972 once we add in the other variables, so it definitely seems like adding in the other variables will be very beneficial.
Now that we know that adding in other variables adds values I want to see the effect of the lagging average selling price, number of permits, and the US HPI 5 period behind the number of listing. To do this I create another list of variables when everything is lagged behind listing 5, then I created another one just with last five periods removed to compare with the lagged 5 dataset.
kcvarts2<-as.data.frame(cbind(c(kcvarts[6:120,2]),
c(kcvarts[1:115,3]),
c(kcvarts[1:115,4])))
kcvarts2<-ts(kcvarts2, start=c(2007,8),frequency=12)
kcvarts3<-kcvarts[1:115,2:4]
kcvarts3<-ts(kcvarts3, start=c(2007,8),frequency=12)Now that we have the new variable we can test for significance.
Using these new variables I want to create two new models to compare the effects of transforming the variables.
kcts2<-ts(kcts[1:115], start=c(2007,8),frequency=12)
fit16 = tslm(kcts2~trend+season+kcvarts2)
fit17 = tslm(kcts2~trend+season+kcvarts3)
summary(fit16)##
## Call:
## tslm(formula = kcts2 ~ trend + season + kcvarts2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10592 -2259 433 2301 9500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21616.247 17522.117 -1.23 0.22025
## trend 346.698 44.139 7.85 4.9e-12 ***
## season2 -2387.238 1757.493 -1.36 0.17745
## season3 3639.881 1906.259 1.91 0.05910 .
## season4 6740.232 1984.819 3.40 0.00099 ***
## season5 14432.873 2006.843 7.19 1.2e-10 ***
## season6 22930.810 2246.747 10.21 < 2e-16 ***
## season7 19851.995 2257.701 8.79 4.7e-14 ***
## season8 15679.496 2262.808 6.93 4.3e-10 ***
## season9 9064.712 2107.709 4.30 4.0e-05 ***
## season10 8980.673 1991.553 4.51 1.8e-05 ***
## season11 9968.356 1789.882 5.57 2.2e-07 ***
## season12 9726.255 1792.875 5.42 4.1e-07 ***
## kcvarts2V1 1.037 0.469 2.21 0.02925 *
## kcvarts2V2 59.940 41.444 1.45 0.15125
## kcvarts2V3 733.452 56.911 12.89 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3920 on 99 degrees of freedom
## Multiple R-squared: 0.971, Adjusted R-squared: 0.967
## F-statistic: 225 on 15 and 99 DF, p-value: <2e-16
summary(fit17)##
## Call:
## tslm(formula = kcts2 ~ trend + season + kcvarts3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11742 -2070 287 2538 10341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4119.699 17012.587 -0.24 0.80916
## trend 307.522 47.415 6.49 3.5e-09 ***
## season2 -2500.867 1788.692 -1.40 0.16519
## season3 2948.504 1917.840 1.54 0.12738
## season4 5364.122 1967.188 2.73 0.00757 **
## season5 12788.887 2005.545 6.38 5.8e-09 ***
## season6 20361.004 2077.974 9.80 3.0e-16 ***
## season7 16300.389 2024.884 8.05 1.9e-12 ***
## season8 12201.115 2016.064 6.05 2.6e-08 ***
## season9 6108.426 1949.000 3.13 0.00227 **
## season10 6715.149 1924.518 3.49 0.00072 ***
## season11 8803.415 1838.797 4.79 5.9e-06 ***
## season12 9440.739 1820.896 5.18 1.1e-06 ***
## kcvarts3listing 0.587 0.497 1.18 0.23971
## kcvarts3new_permit 76.925 45.552 1.69 0.09442 .
## kcvarts3us_hpi 690.064 56.939 12.12 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3990 on 99 degrees of freedom
## Multiple R-squared: 0.97, Adjusted R-squared: 0.966
## F-statistic: 217 on 15 and 99 DF, p-value: <2e-16
The Adjusted R-squared of the models are very close to each other and it may not be worth going through the trouble of taking the data 5 lags behind listing.
To see if it is actually worth taking the data 5 lags behind listing, I decided to remove the variables that were not significant between the two models. In the first model the number of new permits was not significant and in the second model, the number of the listing was not significant. So I will create two more sets of variables to test against.
kcvarts4<-as.data.frame(cbind(c(kcvarts[6:120,2]),
c(kcvarts[1:115,4])))
kcvarts4<-ts(kcvarts4, start=c(2007,8),frequency=12)
kcvarts5<-kcvarts[1:115,3:4]
kcvarts5<-ts(kcvarts5, start=c(2007,8),frequency=12)Now that the new variables are ready we can test the two models again.
fit18 = tslm(kcts2~trend+season+kcvarts4)
fit19 = tslm(kcts2~trend+season+kcvarts5)
summary(fit18)##
## Call:
## tslm(formula = kcts2 ~ trend + season + kcvarts4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10612 -2327 282 2508 9287
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40024.248 12108.461 -3.31 0.00132 **
## trend 384.883 35.565 10.82 < 2e-16 ***
## season2 -2265.855 1765.045 -1.28 0.20220
## season3 4524.944 1815.193 2.49 0.01432 *
## season4 7937.450 1813.786 4.38 3.0e-05 ***
## season5 15702.233 1814.577 8.65 8.8e-14 ***
## season6 24814.489 1840.702 13.48 < 2e-16 ***
## season7 21502.268 1958.749 10.98 < 2e-16 ***
## season8 17509.769 1886.070 9.28 3.7e-15 ***
## season9 10533.613 1856.928 5.67 1.4e-07 ***
## season10 10241.275 1800.444 5.69 1.3e-07 ***
## season11 10399.355 1774.509 5.86 5.9e-08 ***
## season12 10244.173 1766.313 5.80 7.8e-08 ***
## kcvarts4V1 1.420 0.389 3.65 0.00042 ***
## kcvarts4V2 803.381 30.182 26.62 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3950 on 100 degrees of freedom
## Multiple R-squared: 0.971, Adjusted R-squared: 0.967
## F-statistic: 238 on 14 and 100 DF, p-value: <2e-16
summary(fit19)##
## Call:
## tslm(formula = kcts2 ~ trend + season + kcvarts5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11826 -2119 70 2703 9615
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14475.9 6514.8 2.22 0.02854 *
## trend 254.0 14.1 18.04 < 2e-16 ***
## season2 -2430.7 1791.3 -1.36 0.17784
## season3 3007.1 1921.0 1.57 0.12065
## season4 5633.9 1957.8 2.88 0.00490 **
## season5 13248.7 1971.4 6.72 1.1e-09 ***
## season6 20744.5 2056.6 10.09 < 2e-16 ***
## season7 17086.9 1916.4 8.92 2.3e-14 ***
## season8 12923.1 1925.3 6.71 1.2e-09 ***
## season9 6771.1 1870.5 3.62 0.00046 ***
## season10 7268.8 1870.5 3.89 0.00018 ***
## season11 9283.9 1796.9 5.17 1.2e-06 ***
## season12 9154.9 1808.4 5.06 1.9e-06 ***
## kcvarts5new_permit 111.7 34.9 3.20 0.00182 **
## kcvarts5us_hpi 641.7 39.7 16.15 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4000 on 100 degrees of freedom
## Multiple R-squared: 0.97, Adjusted R-squared: 0.966
## F-statistic: 231 on 14 and 100 DF, p-value: <2e-16
Again we get two models that are very close in terms of adjusted R-squared with the first model have .967 and the second model have one of .966. It worth noting too that both models have a p-value of well under .05, meaning that the models are significant.
Between the two models, I have decided to just use the variable of the number of permits and the US HPI. Even though the other model is performed slightly better, I don’t think a .001 increase in Adjusted R-squared is worth waiting 5 months to predict the average home selling price after the fact. Now that we are just using permits and US HPI I want to run the model one more time using the entire time series again and not have to remove the last five periods.
fit20 = tslm(kcvarts[,1]~trend+season+kcvarts[,c(3,4)])
summary(fit20)##
## Call:
## tslm(formula = kcvarts[, 1] ~ trend + season + kcvarts[, c(3,
## 4)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -11780 -2022 -122 2470 9552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17720.1 5699.7 3.11 0.00242 **
## trend 252.0 13.8 18.30 < 2e-16 ***
## season2 -2455.8 1766.8 -1.39 0.16748
## season3 2728.1 1835.3 1.49 0.14015
## season4 5283.8 1854.2 2.85 0.00527 **
## season5 13078.5 1875.7 6.97 2.9e-10 ***
## season6 20321.7 1963.6 10.35 < 2e-16 ***
## season7 16379.0 1823.5 8.98 1.2e-14 ***
## season8 12710.6 1890.5 6.72 9.5e-10 ***
## season9 6609.8 1840.0 3.59 0.00050 ***
## season10 7101.1 1839.5 3.86 0.00020 ***
## season11 9246.9 1772.3 5.22 9.2e-07 ***
## season12 9070.2 1782.4 5.09 1.6e-06 ***
## kcvarts[, c(3, 4)]new_permit 123.9 32.8 3.78 0.00026 ***
## kcvarts[, c(3, 4)]us_hpi 622.7 35.3 17.66 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3950 on 105 degrees of freedom
## Multiple R-squared: 0.975, Adjusted R-squared: 0.972
## F-statistic: 297 on 14 and 105 DF, p-value: <2e-16
I am really happy with the way the TSLM turned out. In the LM model format, we can use the coefficients of the seasonal effect to indeed see that February is indeed the month with the lowest home selling price where June is the highest. You can also see as the US HPI increase so does the prices of houses but interesting enough the home prices increases with new permits as well, which is the opposite of what I would expect.
Now that we have a new model I want to see if the residuals of the new model have any autocorrelation in the residuals.
dwtest(fit20, alt="two.sided")##
## Durbin-Watson test
##
## data: fit20
## DW = 1.8, p-value = 0.2
## alternative hypothesis: true autocorrelation is not 0
In the Durbin-Watson test, the p-value is greater can .05 so can accept the null hypothesis that there is no autocorrelation significant autocorrelation in the residuals.
Now that we have a model I wanted to get an idea of what a forecast may look like so I used the Auto ARIMA model to forecast some future values of the new permits and US HPI. Now with this being said, I’m not going to compare this with other forecasts that I have created because the error with this forecast model would be much great but I thought it was an exercise worth trying.
kcpermit<-forecast(auto.arima(kcvarts[,3]),h=36)
kcus_hpi<-forecast(auto.arima(kcvarts[,4]),h=36)
kcvarts6<-as.data.frame(cbind(c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),c(kcpermit$mean),c(kcus_hpi$mean)))
kcvarts<-ts(kcvarts6, start=c(2017,8),frequency=12)Now that I have forecasted variables I can create a forecast based on those other variables.
fc20 <- forecast(fit20, newdata=kcvarts)
fc20%>%autoplot+
labs(title = "TSLM Forecast Using Permits & US HPI",subtitle="From 2015 to 2021",x = "Date", y = "Price")+
theme_minimal() +
scale_colour_calc() +
theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+coord_cartesian(ylim = c(125000,325000))Overall the new forecast looks very good. If I was using real predictions of new housing permits and US HPI I think I could create a fairly close forecast of where the Kansas City average home selling price will be at.
Now that I have created a TSLM model I wanted to create a new ARIMAX model. But since in the last model lagging everything behind listing did not seem to very productive I decided to skip that process in the ARIMAX model.
Unfortunately for the ARIMAX, it will not tell us which variables are significant like the TSLM, so I will have to run all 7 combinations of the 3 variables through the Auto ARIMA function to see which model is the most accurate.
kcvarts <- ts(kchouse[,2:5], start=c(2007,8),frequency=12)
fit22<-auto.arima(kcts,xreg = kcvarts[,2])
fit23<-auto.arima(kcts,xreg = kcvarts[,3])
fit24<-auto.arima(kcts,xreg = kcvarts[,4])
fit25<-auto.arima(kcts,xreg = kcvarts[,2:3])
fit26<-auto.arima(kcts,xreg = kcvarts[,c(2,4)])
fit27<-auto.arima(kcts,xreg = kcvarts[,3:4])
fit28<-auto.arima(kcts,xreg = kcvarts[,2:4])
fit22## Series: kcts
## Regression with ARIMA(1,1,1)(0,1,1)[12] errors
##
## Coefficients:
## ar1 ma1 sma1 xreg
## -0.268 -0.341 -0.816 0.712
## s.e. 0.149 0.137 0.138 0.910
##
## sigma^2 estimated as 24472982: log likelihood=-1067
## AIC=2143 AICc=2144 BIC=2157
fit23## Series: kcts
## Regression with ARIMA(2,1,1)(2,0,0)[12] errors
##
## Coefficients:
## ar1 ar2 ma1 sar1 sar2 drift xreg
## 0.136 0.087 -0.675 0.359 0.466 319.3 153.17
## s.e. 0.320 0.187 0.283 0.082 0.093 719.5 72.04
##
## sigma^2 estimated as 27885348: log likelihood=-1192
## AIC=2400 AICc=2401 BIC=2422
fit24## Series: kcts
## Regression with ARIMA(1,1,1)(2,1,1)[12] errors
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
## ar1 ma1 sar1 sar2 sma1 xreg
## -0.047 -0.801 -0.366 -0.053 -0.444 875.60
## s.e. 0.001 NaN NaN NaN NaN 90.44
##
## sigma^2 estimated as 21861939: log likelihood=-1057
## AIC=2129 AICc=2130 BIC=2148
fit25## Series: kcts
## Regression with ARIMA(2,0,0)(2,0,0)[12] errors
##
## Coefficients:
## ar1 ar2 sar1 sar2 intercept listing new_permit
## 0.545 0.401 0.372 0.448 178026 0.509 117.07
## s.e. 0.090 0.091 0.082 0.087 32109 1.004 64.75
##
## sigma^2 estimated as 29005226: log likelihood=-1206
## AIC=2427 AICc=2428 BIC=2449
fit26## Series: kcts
## Regression with ARIMA(0,1,1)(1,1,0)[12] errors
##
## Coefficients:
## ma1 sar1 listing us_hpi
## -0.834 -0.573 0.592 981.8
## s.e. 0.057 0.080 0.537 119.0
##
## sigma^2 estimated as 23104270: log likelihood=-1060
## AIC=2130 AICc=2130 BIC=2143
fit27## Series: kcts
## Regression with ARIMA(1,1,0)(2,0,0)[12] errors
##
## Coefficients:
## ar1 sar1 sar2 new_permit us_hpi
## -0.455 0.350 0.439 89.74 805.4
## s.e. 0.089 0.082 0.088 63.01 349.3
##
## sigma^2 estimated as 28214224: log likelihood=-1193
## AIC=2397 AICc=2398 BIC=2414
fit28## Series: kcts
## Regression with ARIMA(0,0,0)(2,1,0)[12] errors
##
## Coefficients:
## sar1 sar2 drift xreg.listing xreg.new_permit xreg.us_hpi
## -0.685 -0.273 296.45 0.539 84.08 679.07
## s.e. 0.098 0.110 50.48 0.492 43.24 55.68
##
## sigma^2 estimated as 22655118: log likelihood=-1068
## AIC=2149 AICc=2150 BIC=2168
It seemed like all 7 models are different from each other but have a very similar accuracy in glancing at the AICs of the models.
To see which model did the best for accounting for all the autocorrelation in the dataset, I created ACF plots for the residuals for all the models to see to check for remaining autocorrelation.
p1<-fit22 %>% residuals %>% ggAcf +
theme_minimal() +
scale_colour_calc() +
labs(title = "ARIMAX with Listing", x = "Time", y = "Residuals")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p2<-fit23 %>% residuals %>% ggAcf +
theme_minimal() +
scale_colour_calc() +
labs(title = "ARIMAX with Permit", x = "Time", y = "Residuals")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p3<-fit24 %>% residuals %>% ggAcf +
theme_minimal() +
scale_colour_calc() +
labs(title = "ARIMAX with US HPI", x = "Time", y = "Residuals")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p4<-fit25 %>% residuals %>% ggAcf +
theme_minimal() +
scale_colour_calc() +
labs(title = "ARIMAX with Listing & Permit", x = "Time", y = "Residuals")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p5<-fit26 %>% residuals %>% ggAcf +
theme_minimal() +
scale_colour_calc() +
labs(title = "ARIMAX with Listing & US HPI", x = "Time", y = "Residuals")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p6<-fit27 %>% residuals %>% ggAcf +
theme_minimal() +
scale_colour_calc() +
labs(title = "ARIMAX with Permit & US HPI", x = "Time", y = "Residuals")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
p7<-fit28 %>% residuals %>% ggAcf +
theme_minimal() +
scale_colour_calc() +
labs(title = "ARIMAX with Listing, Permit, & US HPI", x = "Time", y = "Residuals")+
theme(legend.position="none",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
multiplot(p1,p2,p3,p4,p5,p6,p7,cols=1)In looking at the ACF plots it seems like the 2nd and 4th models are the only ones without any autocorrelation in the residuals, but other models are very close to not having any correlation. But it worth noting that this is better than the first ARIMA models I ran where none of the models created residuals with no autocorrelation.
Now that we have done the visual test for autocorrelation I want to run a statistical test for autocorrelation in the residuals.
Box.test(residuals(fit22), lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: residuals(fit22)
## X-squared = 22, df = 24, p-value = 0.6
Box.test(residuals(fit23), lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: residuals(fit23)
## X-squared = 17, df = 24, p-value = 0.9
Box.test(residuals(fit24), lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: residuals(fit24)
## X-squared = 14, df = 24, p-value = 0.9
Box.test(residuals(fit25), lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: residuals(fit25)
## X-squared = 18, df = 24, p-value = 0.8
Box.test(residuals(fit26), lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: residuals(fit26)
## X-squared = 19, df = 24, p-value = 0.8
Box.test(residuals(fit27), lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: residuals(fit27)
## X-squared = 27, df = 24, p-value = 0.3
Box.test(residuals(fit28), lag=24, fitdf=0, type="Lj")##
## Box-Ljung test
##
## data: residuals(fit28)
## X-squared = 32, df = 24, p-value = 0.1
Though doing the Box-Ljung test it seems like 6 of the 7 models statistically don’t have autocorrelation in the residuals.
Now that we know that most models account for the autocorrelation in the data, let’s compare the models for accuracy.
summary(fit22)## Series: kcts
## Regression with ARIMA(1,1,1)(0,1,1)[12] errors
##
## Coefficients:
## ar1 ma1 sma1 xreg
## -0.268 -0.341 -0.816 0.712
## s.e. 0.149 0.137 0.138 0.910
##
## sigma^2 estimated as 24472982: log likelihood=-1067
## AIC=2143 AICc=2144 BIC=2157
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 1156 4583 3450 0.6225 1.989 0.3241 -0.07778
summary(fit23)## Series: kcts
## Regression with ARIMA(2,1,1)(2,0,0)[12] errors
##
## Coefficients:
## ar1 ar2 ma1 sar1 sar2 drift xreg
## 0.136 0.087 -0.675 0.359 0.466 319.3 153.17
## s.e. 0.320 0.187 0.283 0.082 0.093 719.5 72.04
##
## sigma^2 estimated as 27885348: log likelihood=-1192
## AIC=2400 AICc=2401 BIC=2422
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 359.7 5102 4152 0.1199 2.411 0.39 -0.01376
summary(fit24)## Series: kcts
## Regression with ARIMA(1,1,1)(2,1,1)[12] errors
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
## ar1 ma1 sar1 sar2 sma1 xreg
## -0.047 -0.801 -0.366 -0.053 -0.444 875.60
## s.e. 0.001 NaN NaN NaN NaN 90.44
##
## sigma^2 estimated as 21861939: log likelihood=-1057
## AIC=2129 AICc=2130 BIC=2148
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -322.9 4290 3208 -0.1865 1.837 0.3013 -0.02273
summary(fit25)## Series: kcts
## Regression with ARIMA(2,0,0)(2,0,0)[12] errors
##
## Coefficients:
## ar1 ar2 sar1 sar2 intercept listing new_permit
## 0.545 0.401 0.372 0.448 178026 0.509 117.07
## s.e. 0.090 0.091 0.082 0.087 32109 1.004 64.75
##
## sigma^2 estimated as 29005226: log likelihood=-1206
## AIC=2427 AICc=2428 BIC=2449
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 413.5 5226 4229 0.1249 2.452 0.3973 -0.06411
summary(fit26)## Series: kcts
## Regression with ARIMA(0,1,1)(1,1,0)[12] errors
##
## Coefficients:
## ma1 sar1 listing us_hpi
## -0.834 -0.573 0.592 981.8
## s.e. 0.057 0.080 0.537 119.0
##
## sigma^2 estimated as 23104270: log likelihood=-1060
## AIC=2130 AICc=2130 BIC=2143
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -512 4453 3376 -0.3191 1.941 0.3171 -0.05139
summary(fit27)## Series: kcts
## Regression with ARIMA(1,1,0)(2,0,0)[12] errors
##
## Coefficients:
## ar1 sar1 sar2 new_permit us_hpi
## -0.455 0.350 0.439 89.74 805.4
## s.e. 0.089 0.082 0.088 63.01 349.3
##
## sigma^2 estimated as 28214224: log likelihood=-1193
## AIC=2397 AICc=2398 BIC=2414
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -13.74 5177 4192 -0.07204 2.428 0.3937 -0.1236
summary(fit28)## Series: kcts
## Regression with ARIMA(0,0,0)(2,1,0)[12] errors
##
## Coefficients:
## sar1 sar2 drift xreg.listing xreg.new_permit xreg.us_hpi
## -0.685 -0.273 296.45 0.539 84.08 679.07
## s.e. 0.098 0.110 50.48 0.492 43.24 55.68
##
## sigma^2 estimated as 22655118: log likelihood=-1068
## AIC=2149 AICc=2150 BIC=2168
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -13.91 4388 3333 -0.04295 1.902 0.3131 0.0973
In looking across all the models it seems like models 3, 5, and 6 have the lowest AIC of the 7 models that were created. And of those models three models, model 5 has the lowest RMSE which is the model that I will use for the creating the final forecast. The ARIMAX model uses listings and the US HPI to forecast future home selling prices in Kansas City.
Now that we know which variables we need to create a forecast I am going to use the Auto ARIMA method again to produce some future variables to create a forecast, again I know this forecast is not significant with the amount error that is truly involved making it but it is worth trying.
kclisting<-forecast(auto.arima(kcvarts[,2]),h=36)
kcus_hpi<-forecast(auto.arima(kcvarts[,4]),h=36)
kcvarts7<-as.data.frame(cbind(c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),c(kclisting$mean),c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),c(kcus_hpi$mean)))
kcvarts<-ts(kcvarts6, start=c(2017,8),frequency=12)For this time around we just created forecast for the number of listing and the US HPI again.
Now that we have forecasted variables we can create the last forecast using the ARIMAX model.
fc26 <- forecast(fit26, xreg=kcvarts7[,c(2,4)] , h=36)
fc26%>%autoplot+
labs(title = "ARIMAX with Listing, Permit, & US HPI",subtitle="From 2007 to 2021",x = "Date", y = "Price")+
theme_minimal() +
scale_colour_calc() +
theme(legend.position="bottom",legend.title=element_blank(),plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))+coord_cartesian(ylim = c(125000,325000))The last forecasted model seems to be another reasonable forecast of where average prices are likely to go. But between the ARIMAX model and the TSLM, I prefer using the TSLM model. I like the LM format a lot and using is very easy to understand the effect the seasonality trend and other variables have on the models, and with the forecasted variable the TSLM model produced a lot less a confidence interval, which is good in this case, saying that we have the know variables for new permits and US HPI then we could make a forecast with not a lot of variance in the error.